class: center, middle, inverse, title-slide # Principle component analysis --- # Why reduce dimensions? High dimensional data is difficult to analyse: - many variables (p) - difficult to visualise and intuit - sometimes no clear response variable. -- Reducing dimensionality can retain (most of) the original information and make the data easier to understand and work with. -- Principal Component Analysis (PCA) is one such method for identifying the main sources of variation in a dataset. --- # What is PCA (informally)? PCA finds linear combinations of the input features that explain a large amount of the variation in the data, combining them into new features called "principal components" (PC). -- The first PC has largest explanatory power. -- Each of these PCs is independent from the previous one. -- The contribution of each feature to each PC is measured by its loading. --- # What is PCA (formally)? The first principal component, `\(Z_1\)` is calculated using the equation: `\(Z_1 = a_{1} X_{1} + a_{21} X_{2} + \dots + a_{p1} X_{p}\)` `\(X_1, ..., X_p\)` are features in the original dataset and `\(a_{11}, ..., a_p\)` are weights or loadings. --- # PCA intuitively <img src="data:image/png;base64,#/home/alan/Documents/github/carpentries/high-dimensional-stats-r/fig/pendulum.gif" width="500px" /> --- # PCA is a rotation of the data <img src="data:image/png;base64,#/home/alan/Documents/github/carpentries/high-dimensional-stats-r/fig/bio_index_vs_percentage_fallow.png" width="700px" />